Unconstrained handwritten document retrieval
Identifieur interne : 000538 ( Main/Exploration ); précédent : 000537; suivant : 000539Unconstrained handwritten document retrieval
Auteurs : HUAIGU CAO [États-Unis] ; Venugopal Govindaraju [États-Unis] ; Anurag Bhardwaj [États-Unis]Source :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2011.
Descripteurs français
- Pascal (Inist)
- Caractère manuscrit, Recherche documentaire, Recherche information, Système information, Interrogation base donnée, Reconnaissance optique caractère, Texte, Linguistique, Langage naturel, Systématique, Reconnaissance caractère, Analyse image, Réseau web, Défaut, Evaluation performance, Autogénération mutuelle, Métrique, Segmentation, Espace vectoriel, Modélisation, Reconnaissance écriture.
- Wicri :
- topic : Recherche documentaire, Linguistique.
English descriptors
- KwdEn :
- Bootstrapping, Character recognition, Database query, Defect, Document retrieval, Handwriting recognition, Image analysis, Information retrieval, Information system, Linguistics, Manuscript character, Metric, Modeling, Natural language, Optical character recognition, Performance evaluation, Segmentation, Taxonomy, Text, Vector space, World wide web.
Abstract
With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR'ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR'ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.
Affiliations:
- États-Unis
- Massachusetts, État de New York
- Buffalo (New York)
- Université d'État de New York, Université d'État de New York à Buffalo
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000124
- to stream PascalFrancis, to step Curation: 000649
- to stream PascalFrancis, to step Checkpoint: 000092
- to stream Main, to step Merge: 000544
- to stream Main, to step Curation: 000538
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Unconstrained handwritten document retrieval</title>
<author><name sortKey="Huaigu Cao" sort="Huaigu Cao" uniqKey="Huaigu Cao" last="Huaigu Cao">HUAIGU CAO</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Govindaraju, Venu" sort="Govindaraju, Venu" uniqKey="Govindaraju V" first="Venu" last="Govindaraju">Venugopal Govindaraju</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Amherst, NY 14260</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
<placeName><settlement type="city">Buffalo (New York)</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="university" n="3">Université d'État de New York à Buffalo</orgName>
<orgName type="institution">Université d'État de New York</orgName>
</affiliation>
</author>
<author><name sortKey="Bhardwaj, Anurag" sort="Bhardwaj, Anurag" uniqKey="Bhardwaj A" first="Anurag" last="Bhardwaj">Anurag Bhardwaj</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Amherst, NY 14260</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">11-0343811</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0343811 INIST</idno>
<idno type="RBID">Pascal:11-0343811</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000124</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000649</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000092</idno>
<idno type="wicri:doubleKey">1433-2833:2011:Huaigu Cao:unconstrained:handwritten:document</idno>
<idno type="wicri:Area/Main/Merge">000544</idno>
<idno type="wicri:Area/Main/Curation">000538</idno>
<idno type="wicri:Area/Main/Exploration">000538</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Unconstrained handwritten document retrieval</title>
<author><name sortKey="Huaigu Cao" sort="Huaigu Cao" uniqKey="Huaigu Cao" last="Huaigu Cao">HUAIGU CAO</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Govindaraju, Venu" sort="Govindaraju, Venu" uniqKey="Govindaraju V" first="Venu" last="Govindaraju">Venugopal Govindaraju</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Amherst, NY 14260</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
<placeName><settlement type="city">Buffalo (New York)</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="university" n="3">Université d'État de New York à Buffalo</orgName>
<orgName type="institution">Université d'État de New York</orgName>
</affiliation>
</author>
<author><name sortKey="Bhardwaj, Anurag" sort="Bhardwaj, Anurag" uniqKey="Bhardwaj A" first="Anurag" last="Bhardwaj">Anurag Bhardwaj</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Amherst, NY 14260</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Bootstrapping</term>
<term>Character recognition</term>
<term>Database query</term>
<term>Defect</term>
<term>Document retrieval</term>
<term>Handwriting recognition</term>
<term>Image analysis</term>
<term>Information retrieval</term>
<term>Information system</term>
<term>Linguistics</term>
<term>Manuscript character</term>
<term>Metric</term>
<term>Modeling</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Performance evaluation</term>
<term>Segmentation</term>
<term>Taxonomy</term>
<term>Text</term>
<term>Vector space</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Caractère manuscrit</term>
<term>Recherche documentaire</term>
<term>Recherche information</term>
<term>Système information</term>
<term>Interrogation base donnée</term>
<term>Reconnaissance optique caractère</term>
<term>Texte</term>
<term>Linguistique</term>
<term>Langage naturel</term>
<term>Systématique</term>
<term>Reconnaissance caractère</term>
<term>Analyse image</term>
<term>Réseau web</term>
<term>Défaut</term>
<term>Evaluation performance</term>
<term>Autogénération mutuelle</term>
<term>Métrique</term>
<term>Segmentation</term>
<term>Espace vectoriel</term>
<term>Modélisation</term>
<term>Reconnaissance écriture</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Recherche documentaire</term>
<term>Linguistique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR'ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR'ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Massachusetts</li>
<li>État de New York</li>
</region>
<settlement><li>Buffalo (New York)</li>
</settlement>
<orgName><li>Université d'État de New York</li>
<li>Université d'État de New York à Buffalo</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Massachusetts"><name sortKey="Huaigu Cao" sort="Huaigu Cao" uniqKey="Huaigu Cao" last="Huaigu Cao">HUAIGU CAO</name>
</region>
<name sortKey="Bhardwaj, Anurag" sort="Bhardwaj, Anurag" uniqKey="Bhardwaj A" first="Anurag" last="Bhardwaj">Anurag Bhardwaj</name>
<name sortKey="Govindaraju, Venu" sort="Govindaraju, Venu" uniqKey="Govindaraju V" first="Venu" last="Govindaraju">Venugopal Govindaraju</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000538 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000538 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:11-0343811 |texte= Unconstrained handwritten document retrieval }}
This area was generated with Dilib version V0.6.32. |